------------------------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  E:\REStat_MS14767_Vol96(2)\Data preparation Compustat segment\9_match_cq.log
  log type:  text
 opened on:  19 Dec 2014, 18:59:56

. *-----------------------------------------------------------------
. 
. *****************************************
. * starting data2.dta to master_compustat_all_countries.dta
. *
. * This file reconstructs the connection between match.dta and compustat.dta
. * so that the data from compustat dataset may later be merged with patent data.
. * It is equivalent to the patents_match.do file.
. * Match.dta contains 5000 entries.
. **************************************************
. 
. set mem 600m

. 
. use "compustat+segment.dta", clear

. keep ticker cusip name dataset

. so ticker

. drop if ticker==ticker[_n-1]
(151949 observations deleted)

. 
. 
. 
. ****************************************
. * In order to find matches within match.dta, 9-digit cusip is changed to 6-digit cusip
. * and renamed. cusip9 is the original cusip.
. ************************************
. 
. rename cusip cusip9

. gen cusip= substr(cusip9, 1, length(cusip9) - 3)

. 
. label var cusip "6-digit cusip from compustat"

. label var cusip9 "9-digit cusip from compustat" 

. 
. *egen dataset2 = group(dataset)
. 
. 
. 
. sort cusip

. egen comnum = group(ticker)

. egen x = count(comnum), by(cusip)

. 
. 
. save "patent files\segment_master_red.dta", replace
file patent files\segment_master_red.dta saved

. *egen x2 = seq(), by(cusip)
. *reshape wide ticker name, i(cusip) j(x2)
. 
. *save "patent files\segment_master_red_wide.dta", replace
. 
. 
. 
. 
. ***************************
. * match.dta has cusip as identifier, contains 5,000 entries compared to 
. * 23,000 entries in compustat
. ******************************
. 
. use "patent files\patents\match.dta", clear
file patent files\patents\match.dta not found
r(601);

end of do-file
r(601);

end of do-file

r(601);

. use "patent files\patents\match.dta", clear

. 
. 
. 
. 
. 
. joinby cusip using "patent files\segment_master_red.dta", unmatched(both) 

. 
. 
. 
. tab _merge

                       _merge |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
          only in master data |      1,862        8.68        8.68
           only in using data |     16,537       77.08       85.76
both in master and using data |      3,054       14.24      100.00
------------------------------+-----------------------------------
                        Total |     21,453      100.00

. 
. 
. 
. ******************

. 
. *------------------------+-----------------------------------

. 
. *       master only (1) |      1,829        8.53        8.53

. 
. *         using only (2) |     16,542       77.13       85.65

. 
. *            matched (3) |      3,077       14.35      100.00

. 
. *------------------------+-----------------------------------

. 
. *                  Total |     21,448      100.00

. 
. *

. 
. **************************************

. 
. 
. 
. drop if _merge==2
(16537 observations deleted)

. 
. br if x>1 &x !=. & cname !=""

. 
. 
. 
. *Here some manual adjustment are needed,:

. 
. * for the merge we used cusip6  which does not uniquely identify 

. 
. * the Firm:

. 
. 
. 
. *cusiptickernamecname

. 
. *372917GZSPGENZYME SURGICAL PRODUCTSGENZYME CORP

. 
. *372917GZBXGENZYME BIOSURGERYGENZYME CORP

. 
. *372917GZTRGENZYME TISSUE REPAIRGENZYME CORP

. 
. *372917GENZGENZYME CORPGENZYME CORP

. 
. *372917GZMOGENZYME MOLECULAR ONCOLOGYGENZYME CORP

. 
. *373298TGP.1GEORGIA-PACIFIC TIMBER COGEORGIA-PACIFIC CORP

. 
. *373298GP.1GEORGIA-PACIFIC CORPGEORGIA-PACIFIC CORP

. 
. *747906HDDQUANTUM CORP HDDGQUANTUM CORP

. 
. *747906QTMQUANTUM CORPQUANTUM CORP

. 
. *747906HDDQUANTUM CORP HDDGQUANTUM CORP

. 
. *747906QTMQUANTUM CORPQUANTUM CORP

. 
. *254687DIG.3DISNEY (WALT) INTERNET GROUPDISNEY (WALT) COMPANY

. 
. *254687DISDISNEY (WALT) CODISNEY (WALT) COMPANY

. 
. *254687DISDISNEY (WALT) CODISNEY (WALT) COMPANY

. 
. *254687DIG.3DISNEY (WALT) INTERNET GROUPDISNEY (WALT) COMPANY

. 
. *373298TGP.1GEORGIA-PACIFIC TIMBER COGEORGIA-PACIFIC CORP

. 
. *373298GP.1GEORGIA-PACIFIC CORPGEORGIA-PACIFIC CORP

. 
. 
. 
. 
. 
. * I took those where the cname is equal or almost equal to the name variable:

. 
. 
. 
. drop if ticker == "GZSP"|ticker =="GZBX"|ticker=="GZTR"|ticker == "GZMO"
(4 observations deleted)

. 
. drop if ticker =="TGP.1"
(2 observations deleted)

. 
. drop if ticker == "HDD"
(2 observations deleted)

. 
. drop if ticker == "DIG.3"
(2 observations deleted)

. 
. 
. 
. 
. 
. gen cusip_match=0

. 
. replace cusip_match=1 if _merge==3
(3044 real changes made)

. 
. drop _merge x comnum

. 
. 
. 
. 
. 
. gen cusip_name=name
(1862 missing values generated)

. 
. replace name=cname
name was str29 now str30
(2056 real changes made)

. 
. rename cusip9 cusip_cusip9

. 
. 
. 
. 
. 
. 
. 
. *label var cname "name in compustat data set"

. 
. label var cusip "unique identifier"

. 
. 
. 
. label var assignee "assignee identi in compustat data set"

. 
. label var pname "name of parent firm"

. 
. label var sname "name of subsidary firm"

. 
. label var assname "patent assignee name "

. 
. label var cusip_cusip9 "9-digit cusip number"

. 
. label var cusip_name "original compustat+segment name"

. 
. label var cusip_match "valid cusip connection between match and compustat"

. 
. 
. 
. desc

Contains data
  obs:         4,906                          
 vars:            13                          
 size:     1,314,808                          
------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------
assignee        long    %12.0g                assignee identi in compustat data set
assname         str59   %59s                  patent assignee name
cname           str30   %30s                  
cusip           str6    %9s                   unique identifier
own             byte    %8.0g                 
pname           str33   %33s                  name of parent firm
sname           str50   %50s                  name of subsidary firm
ticker          str8    %9s                   Ticker Symbol
dataset         str6    %9s                   Indicates the origin of Obvervation
cusip_cusip9    str9    %9s                   9-digit cusip number
name            str30   %30s                  
cusip_match     float   %9.0g                 valid cusip connection between match and compustat
cusip_name      str28   %28s                  original compustat+segment name
------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

. 
. 
. 
. save "patent files\cusipmerged.dta", replace
file patent files\cusipmerged.dta saved

. 
. clear

. 
. 
. 
. *****************************

. 
. * This was merging cusip, now comes merging names with match.dta (now cusipmerged.dta)

. 
. *******************************

. 
. 
. 
. use "patent files\segment_master_red.dta", clear

. 
. drop cusip comnum x

. 
. save "patent files\segment_master_red.dta", replace
file patent files\segment_master_red.dta saved

. 
. 
. 
. 
. 
. use "patent files\cusipmerged.dta", clear

. 
. joinby name using "patent files\segment_master_red.dta", unmatched(both) 

. 
. 
. 
. tab _merge

                       _merge |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
          only in master data |      2,140        9.91        9.91
           only in using data |     16,695       77.29       87.20
both in master and using data |      2,766       12.80      100.00
------------------------------+-----------------------------------
                        Total |     21,601      100.00

. 
. drop if _merge==2
(16695 observations deleted)

. 
. gen name_match=0

. 
. replace name_match=1 if _merge==3
(2766 real changes made)

. 
. drop _merge

. 
. rename cusip9 name_cusip9

. 
. count if cusip_match==0 & name_match==0
 1763

. 
. 
. 
. label var name_match "valid name connection between match and compustat"

. 
. 
. 
. save "patent files\cusipnamemerged.dta", replace
file patent files\cusipnamemerged.dta saved

. 
. 
. 
. count if name_match==1
 2766

. 
. count if cusip_match==1
 3044

. 
. 
. 
. ****************************

. 
. * Now the dataset contains 3003 name-matches and 3231 cusip-matches, 1589 non-matched***

. 
. **********************************

. 
. 
. 
. insheet using "patent files\manual_matching_cq_ip.csv", clear
(5 vars, 174 obs)

. 
. save "patent files\manual_matching_cq_ip.dta", replace
file patent files\manual_matching_cq_ip.dta saved

. 
. drop in 1/1
(1 observation deleted)

. 
. rename v1 manual_name

. 
. rename v2 manual_cusip9

. 
. rename v3 cname

. 
. rename v4 cusip

. 
. rename v5 error

. 
. so manual_name

. 
. save "patent files\manual_matching_cq.dta", replace
file patent files\manual_matching_cq.dta saved

. 
. 
. 
. use "patent files\segment_master_red.dta", clear

. 
. 
. 
. keep name  cusip9

. 
. rename name manual_name

. 
. rename cusip9 manual_cusip9

. 
. joinby manual_name using "patent files\manual_matching_cq.dta", unmatched(both) 

. 
. tab _merge

                       _merge |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
          only in master data |     17,869       99.04       99.04
           only in using data |         50        0.28       99.32
both in master and using data |        123        0.68      100.00
------------------------------+-----------------------------------
                        Total |     18,042      100.00

. 
. keep if _merge==3
(17919 observations deleted)

. 
. sort manual_name

. 
. keep manual_name manual_cusip9 cname

. 
. 
. 
. save "patent files\manual_matching_cq.dta", replace
file patent files\manual_matching_cq.dta saved

. 
. 
. 
. use "patent files\cusipnamemerged.dta", clear

. 
. 
. 
. joinby cname using "patent files\manual_matching_cq.dta", unmatched(both) 

. 
. gen manual=0

. 
. replace manual=1 if _merge==3 & manual_name~="n"
(271 real changes made)

. 
. tab _merge

                       _merge |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
          only in master data |      4,638       94.48       94.48
both in master and using data |        271        5.52      100.00
------------------------------+-----------------------------------
                        Total |      4,909      100.00

. 
. drop _merge

. 
. 
. 
. replace cname=manual_name if manual==1
(271 real changes made)

. 
. drop manual_name

. 
. replace cusip=manual_cusip9 if manual==1
cusip was str6 now str9
(271 real changes made)

. 
. drop manual_cusip

. 
. replace cname=cusip_name if cusip_match==1 & name_match==0
(377 real changes made)

. 
. drop cusip_name

. 
. replace cusip=name_cusip9 if cusip_match==0 & name_match==1
(99 real changes made)

. 
. drop name_cusip9

. 
. replace cusip=cusip_cusip9 if cusip_match==1
(3044 real changes made)

. 
. drop cusip_cusip9 

. 
. 
. 
. **********************************

. 
. * The following steps prepare and examine the merge between compustat and cusipnamemerged

. 
. * in order to append assignee numbers and names to compustat for later merge with patents

. 
. **********************************

. 
. 
. 
. so assignee

. 
. drop if assignee==assignee[_n-1] & cusip == cusip[_n-1]
(8 observations deleted)

. 
. drop if cusip_match==0 & name_match==0 & manual==0
(1501 observations deleted)

. 
. 
. 
. save "patent files\cusipnamemerged.dta", replace
file patent files\cusipnamemerged.dta saved

. 
. 
. 
. *******************************************************

. 
. 
. 
. 
. 
. use "patent files\segment_master_red.dta", clear

. 
. rename cusip9 cusip

. 
. 
. 
. 
. 
. merge 1:m cusip using "patent files\cusipnamemerged.dta"
(note: variable name was str29, now str30 to accommodate using data's values)

    Result                           # of obs.
    -----------------------------------------
    not matched                        16,418
        from master                    16,418  (_merge==1)
        from using                          0  (_merge==2)

    matched                             3,400  (_merge==3)
    -----------------------------------------

. 
. keep if _merge==3
(16418 observations deleted)

. 
. tab _merge

                 _merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
            matched (3) |      3,400      100.00      100.00
------------------------+-----------------------------------
                  Total |      3,400      100.00

. 
. drop _merge

. 
. 
. 
. so cusip

. 
. count if cusip!=cusip[_n-1]
 1564

. 
. 
. 
. *************************************************************

. 
. * Here it is checked for non-american entries, 

. 
. * those are dropped.

. 
. ******************************************************

. 
. 
. 
. 
. 
. 
. 
. save "patent files\cusipnamemerged.dta", replace
file patent files\cusipnamemerged.dta saved

. 
. 
. 
. keep assignee cusip ticker 

. 
. so assignee

. 
. desc

Contains data from patent files\cusipnamemerged.dta
  obs:         3,400                          
 vars:             3                          19 Dec 2014 19:10
 size:        71,400                          
------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------
ticker          str8    %9s                   Ticker Symbol
cusip           str9    %9s                   9-digit cusip from compustat
assignee        long    %12.0g                assignee identi in compustat data set
------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  assignee
     Note:  dataset has changed since last saved

. 
. save "patent files\cusipnamemerged_red.dta", replace
file patent files\cusipnamemerged_red.dta saved

. 
. 
. 
. desc

Contains data from patent files\cusipnamemerged_red.dta
  obs:         3,400                          
 vars:             3                          19 Dec 2014 19:10
 size:        71,400                          
------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------
ticker          str8    %9s                   Ticker Symbol
cusip           str9    %9s                   9-digit cusip from compustat
assignee        long    %12.0g                assignee identi in compustat data set
------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  assignee

. 
. 
. 
. log close
      name:  <unnamed>
       log:  E:\REStat_MS14767_Vol96(2)\Data preparation Compustat segment\9_match_cq.log
  log type:  text
 closed on:  19 Dec 2014, 19:10:43
------------------------------------------------------------------------------------------------------------------------------------------
